Executed using command line, or a graphical user interface (GUI)
On this course, we use the RStudio GUI (www.rstudio.com)
Everything you need is installed on the training machines
If you are using your own machine, download both R and RStudio
Getting started
R is a program which, once installed on your system, can be launched and is immediately ready to take input directly from the user
There are two ways to launch R:
From the command line (particularly useful if you’re quite familiar with Linux; in the console at the prompt simply type R)
As an application called (very good for beginners)
Launching R Using RStudio
To launch RStudio, find the RStudio icon in the menu bar on the left of the screen and click
RStudio screenshot
Basic concepts in R - command line calculation
The command line can be used as a calculator. Type:
2 +220/5 -sqrt(25) +3^2sin(pi/2)
Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…
Basic concepts in R - variables
A variable is a letter or word which takes (or contains) a value. We use the assignment operator: <-
x <-10
x
myNumber <-25
myNumber
We can perform arithmetic on variables:
sqrt(myNumber)
Basic concepts in R - variables
We can add variables together:
x +myNumber
We can change the value of an existing variable:
x <-21
x
Basic concepts in R - variables
We can set one variable to equal the value of another variable:
x <-myNumber
x
We can modify the contents of a variable:
myNumber <-myNumber +sqrt(16)
myNumber
Basic concepts in R - functions
Functions in R perform operations on arguments (the inputs(s) to the function). We have already used:
sin(x)
This returns the sine of x
In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.
Try these:
sum(3,4,5,6)
max(3,4,5,6)
min(3,4,5,6)
Basic concepts in R - functions
Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order)
seq(from =2, to =20, by =4)
seq(2, 20, 4)
When testing code, it is easier and safer to name the arguments
Basic concepts in R - vectors
The basic data structure in R is a vector – an ordered collection of values.
R treats even single values as 1-element vectors.
The function ccombines its arguments into a vector:
x <-c(3,4,5,6)
x
The square brackets [] indicate the position within the vector (the index).
We can extract individual elements by using the [] notation:
x[1]
x[4]
Basic concepts in R - vectors
We can even put a vector inside the square brackets (vector indexing):
y <-c(2,3)
x[y]
Basic concepts in R - vectors
There are a number of shortcuts to create a vector.
Instead of:
x <-c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
we can write:
x <-3:12
x
Basic concepts in R - vectors
or we can use the seq() function, which returns a vector:
x <-seq(2, 20, 4)
x
x <-seq(2, 20, length.out=5)
x
or we can use the rep() function:
y <-rep(3, 5)
y
y <-rep(1:3, 5)
y
Basic concepts in R - vectors
We have seen some ways of extracting elements of a vector. We can use these shortcuts to make things easier (or more complex!)
x <-3:12# Extract elements from x:
x[3:7]
x[seq(2, 6, 2)]
x[rep(3, 2)]
Basic concepts in R - vectors
We can add an element to a vector:
y <-c(x, 1)
y
We can glue vectors together:
z <-c(x, y)
z
Basic concepts in R - vectors
We can remove element(s) from a vector:
x <-3:12
x[-3]
x[-(5:7)]
x[-seq(2, 6, 2)]
Basic concepts in R - vectors
Finally, we can modify the contents of a vector:
x[6] <-4
x
x[3:5] <-1
x
Remember!
Square brackets [ ] for indexing
Parentheses () for function arguments
Basic concepts in R - vector arithmetic
When applying all standard arithmetic operations to vectors, application is element-wise
x <-1:10
y <-x*2
y
z <-x^2
z
Basic concepts in R - vector arithmetic
Adding two vectors:
y +z
If vectors are not the same length, the shorter one will be recycled:
x +1:2
But be careful if the vector lengths aren’t factors of each other:
x +1:3
Warning in x + 1:3: longer object length is not a
multiple of shorter object length
[1] 2 4 6 5 7 9 8 10 12 11
Basic concepts in R - Character vectors and naming
All the vectors we have seen so far have contained numbers, but we can also store text (/“strings”) in vectors – this is called a character vector.
We can also use the names() function to get a vector of the names of an object:
names(gene.expression)
Exercise: genes and genomes
Let’s try some vector arithmetic. Here are the genome lengths and number of protein coding genes for several model organisms:
Species
Genome size (Mb)
Protein coding genes
Homo sapiens
3,102
20,774
Mus musculus
2,731
23,139
Drosophila melanogaster
169
13,937
Caenorhabditis elegans
100
20,532
Saccharomyces cerevisiae
12
6,692
Create genome.size and coding.genes vectors to hold the data in each column using the c function. Create a species.name vector and use this vector to name the values in the other two vectors.
Exercise: genes and genomes
Let’s assume a coding gene has an average length of 1.5 kilobases. On average, how many base pairs of each genome is made of coding genes? Create a new vector to record this, called coding.bases.
What percentage of each genome is made up of protein coding genes? Use your coding.bases and genome.size vectors to calculate this. (See earlier slides for how to do division in R.)
How many times more bases are used for coding in the human genome compared to the yeast genome? (S. cerevisiae) How many times more bases are in the human genome in total compared to the yeast genome? Look up indices of your vectors to find out.
H. sapiens M. musculus D. melanogaster
1.004545 1.270908 12.370118
C. elegans S. cerevisiae
30.798000 83.650000
Answers to genome exercise
To compare human to yeast:
coding.bases[1]/coding.bases[5]
H. sapiens
3.104304
genome.size[1]/genome.size[5]
H. sapiens
258.5
Answers to genome exercise
Names are usually carried across to the new vector. Sometimes this is what we want (as for coding.pc) but sometimes it is not (when we are comparing human to yeast). We can remove names by setting them to the special NULL value:
Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.
To create an R markdown file, Click on File → New File → R Markdown in Rstudio
markdown is a easy-to-read, easy-to-write text format often used to write HTML, readme files, etc.
a simpler (but not so informative) alternative is to use a script
This will make our analyses reproducible
Format of an R markdown file
Lines 8 - 10: plain text description
Lines 12 - 14: an R code ‘chunk’
Lines 18 to 20: another code chunk, this time producing a plot
md-format
Pressing the Knit HTML (/Knit PDF) button will create a report
See solution-exercise1.Rmd for solution to Exercise 1
All exercises have a markdown template that you can edit
Getting help
This is possibly the most important slide in the whole course!?!
To get help on any R function, type ? followed by the function name. For example:
?seq
This retrieves the syntax and arguments for the function. The help page shows the default order of arguments. It also tells you which package it belongs to.
There is typically a usage example, which you can test using the example function:
example(seq)
Getting help
If you can’t remember the exact name, type ?? followed by your guess. R will return a list of possibilities:
??plot
The Packages tab in the lower-right panel of RStudio will help you locate the help pages for a particular package and its functions
Often there will be a user-guide or ‘vignette’ too
Interacting with the R console
Important – R console symbols:
; end of line (Enables multiple commands to be placed on one line of text)
# comment (indicates text is a comment and not executed)
+ command line wrap (R is waiting for you to complete an expression)
Ctrl-c or escape to clear input line and try again
Ctrl-l to clear window
Use the TAB key for command auto completion
Use up and down arrows to scroll through the command history
R packages
R comes ready loaded with various libraries of functions called packages. For example: the function sum() is in the base package and sd(), which calculates the standard deviation of a vector, is in the stats package
There are 1000s of additional packages provided by third parties, and the packages can be found in numerous server locations on the web called repositories
The two repositories you will come across the most are:
R needs to be told to use the new functions from the installed packages. Use library(...) function to load the newly installed features:
library(ggplot2) # loads ggplot functionslibrary(DESeq) # loads DESeq functionslibrary() # Lists all the packages # you've got installed
2. Data structures
R is designed to handle experimental data
Although the basic unit of R is a vector, we usually handle data in data frames.
A data frame is a set of observations of a set of variables – in other words, the outcome of an experiment.
For example, we might want to analyse information about a set of patients.
To start with, let’s say we have ten patients and for each one we know their name, sex, age, weight and whether they give consent for their data to be made public.
The patients data frame
We are going to create a data frame called ‘patients’, which will have ten rows (observations) and seven columns (variables). The columns must all be equal lengths.
We will explore how to construct these data from scratch.
(in practice, we would usually import such data from a file)
First_Name
Second_Name
Full_Name
Sex
Age
Weight
Consent
Adam
Jones
Adam Jones
Male
50
70.8
TRUE
Eve
Parker
Eve Parker
Female
21
67.9
TRUE
John
Evans
John Evans
Male
35
75.3
FALSE
Mary
Davis
Mary Davis
Female
45
61.9
TRUE
Peter
Baker
Peter Baker
Male
28
72.4
FALSE
Paul
Daniels
Paul Daniels
Male
31
69.9
FALSE
Joanna
Edwards
Joanna Edwards
Female
42
63.5
FALSE
Matthew
Smith
Matthew Smith
Male
33
71.5
TRUE
David
Roberts
David Roberts
Male
57
73.2
FALSE
Sally
Wilson
Sally Wilson
Female
62
64.8
TRUE
Character, numeric and logical data types
Each column is a vector, like previous vectors we have seen, for example:
firstName secondName paste.firstName..secondName. ...
1 Adam Jones Adam Jones
2 Eve Parker Eve Parker
3 John Evans John Evans
4 Mary Davis Mary Davis
5 Peter Baker Peter Baker
6 Paul Daniels Paul Daniels
7 ...
Naming data frame variables
We can access particular variables using the ‘$’operator:
patients$age
R has inferred the names of our data frame variables from the names of the vectors or the commands (e.g. the paste() command)
We can name the variables after we have created a data frame using the names() function, and we can use the same function to see the names:
When creating a data frame, R assumes all character vectors should be categorical variables and converts them to factors. This is not always what we want:
e.g. we are unlikely to be interested in the hypothesis that people called Adam are taller, so it seems a bit silly to represent this as a factor
patients$First_Name
Factors in data frames
We can avoid this by asking R not to treat strings as factors, and then explicitly stating when we want a factor by using factor():
Some calculations are more efficient to do on matrices, e.g.:
rowMeans(e)
[1] 3.5 4.5 5.5 6.5 7.5
Indexing data frames and matrices
You can index multidimensional data structures like matrices and data frames using commas:
object[rows, colums]
e[1,2]
e[1,]
patients[1,2]
patients[1,]
If you don’t provide an index for either rows or columns, all of the rows or columns will be returned.
Advanced indexing
‘Values’ in R are really vectors
Indices are actually vectors, and can be numeric or logical:
s <-letters[1:5]
s
[1] "a" "b" "c" "d" "e"
# View some of the values in s:
s[c(1,3)]
s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
Advanced indexing
We can do the logical test and indexing in the same line of R code
R will do the test first, and then use the vector of TRUE and FALSE values to subset the vector
a <-1:5# Logical tests:
a <3
[1] TRUE TRUE FALSE FALSE FALSE
s[a <3]
[1] "a" "b"
Operators
Operators allow us to combine multiple logical tests
comparison operators <, >, <=, >=, ==, !=
logical operators !, &, |, xor
The operators for ‘comparison’ and ‘logical’ always return logical values! i.e. (TRUE, FALSE)
s
[1] "a" "b" "c" "d" "e"
a
[1] 1 2 3 4 5
s[a >1 &a <3]
s[a ==2]
Exercise: exercise2.Rmd
The markdown template has code to create the patients data frame from the slides
Make a new data frame with three extra variables: country, continent, and height
Make up the data
Make country a character vector but continent a factor
Try the summary() function on your data frame. What does it do? How does it treat vectors (numeric, character, logical) and factors? (What does it do for matrices?)
Use logical indexing to select the following patients from the data frame described in the slides:
Patients under 40
Patients who give consent to share their data
Men who weigh as much or more than the average European male (70.8 kg)
Logical indexing answers: solution-exercise2.pdf
Patients under 40:
patients[patients$Age <40, ]
Patients who give consent to share their data:
patients[patients$Consent ==TRUE, ]
Men who weigh as much or more than the average European male (70.8 kg):
The names of the columns are automatically assigned:
colnames(rawData)
[1] "Patient" "Nuclei" "NB_Amp" "NB_Nor" "NB_Del"
We can use any of these names to access a particular column:
and create a vector
TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list!
rawData$Nuclei
Word of caution
Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)
Word of caution
You will make your life a lot easier if you keep your data tidy:
If given a single vector as an argument, the function plot() will make a scatter plot with the values of the vector on the y axis, and indices in the x axis
In the course folder you will find the file ozone.csv:
Data describing weather conditions in New York City in 1973, obtained from the supplementary data to Biostatistics: A Methodology for the Health Sciences
Can also use Red Green Blue and hexadecimal values:
rgb(0.7, 0.7, 0.7) → A light grey in RGB format`
"#B3B3B3" → The same light grey in hexadecimal
"#0000FF88"→ A semi-transparent blue, in hexadecimal
The hexadecimal system is the native colour system for screen visualisation (e.g. webs). It indicates the intensity of Red, Green and Blue by using two digits for each colour, in a scale from 0-9 and A-F (0 meaning no intensity and F meaning most intense)
Use of colours
Changing the col argument to plot() changes the colour that the points are plotted in:
plot(patients$Age, patients$Weight, col ="red")
Plotting characters
R can use a variety of plotting characters
Each of which has a numeric code
plot(patients$Age, patients$Weight, pch =16)
Plotting characters
Plotting characters
Or you can specify a character:
plot(patients$Age, patients$Weight, pch ="X")
Size of points
Character expansion:
plot(patients$Age, patients$Weight, cex =3)
Size of points
Character expansion:
plot(patients$Age, patients$Weight, cex =0.2)
Colours and characters as vectors
Previously we have used a vector of length 1 as our value of colour and character
We can use a vector of any length:
the values will get recycled (re-used) so that each point gets assigned a value
We can use a pre-defined colour palette (see later)
Other plotting functions use the same arguments as plot()
technical explanation: the arguments are ‘inherited’
boxplot(patients$Weight~patients$Sex,
xlab ="Sex",
ylab ="Weight",
main ="Relationship between Weight and Gender",
col =c("blue","yellow"))
Exercise: exercise4b.Rmd
Can you re-create the following plots? Hint:
See the breaks and freq arguments to hist (?hist) to create 50 bins and display density rather than frequency
For third plot, see the rainbow function (?rainbow)
Don’t worry too much about getting the colours exactly correct
Solutions: solution-exercise4b.pdf
plot(weather$Solar.R, weather$Ozone, col="orange", pch=16,
ylab="Ozone level", xlab="Solar Radiation",
main="Relationship between ozone level and solar radiation")
Solutions
hist(weather$Temp, col="purple", xlab="Temperature",
main="Distribution of Temperature", breaks =50:100,
freq=FALSE)
Solutions
The rainbow() function is used to create a vector of colours for the boxplot; in other words a palette:
Red, Orange, Yellow, Green, Blue, Indigo, Violet, etc.
Other palette functions available: heat.colors(), terrain.colors(), topo.colors(), cm.colors()